Statistical software and applications

Chris Brown, Univeristy of Sydney

Statistical software and applications

60min including questions / interaction

  • Questionnaire (5 mins)

  • Software (20 mins)

  • Statistical tests (30 mins)

  • Survey analysis / Questions (5 mins)

Welcome survey

Survey

Slides

Who am I

  • 1999 - 2002: BSc (Maths / Stats), University of Sydney

  • 2002 - 2005: SPSS (Technical Support)

  • 2004 - 2007: Masters Biostatistics

  • 2005 - Now: Clinical Trials Centre, University of Sydney

  • 2013 - 2015: Cancer Registry Ireland

  • 2016 - Now: Bean Bar You (Chocolate Subscription)

Survey results

Currently downloading and processing your results…

library(redcapAPI)

# Read token from API.txt
read.csv("API.txt", header=FALSE, stringsAsFactors=FALSE) -> api_key

# Connect to recap via API
rcon <- redcapConnection(
  url = "https://redcap.sydney.edu.au/api/",
  token = api_key[1,1]
)

# Download current responses
data <- exportRecordsTyped(rcon) 

Survey results - Statistics?

Survey results - Software?

Characteristic

N = 16

1
SAS 0 (0%)
SPSS 16 (100%)
Stata 3 (19%)
R 3 (19%)
Python 0 (0%)
Git 0 (0%)
REDCap 0 (0%)
Other 0 (0%)
1

n (%)

Survey results - Focus of this talk?

Survey results - Time to complete

Statistical Software

  • SAS

  • SPSS

  • Stata

  • R

  • Python

Other common tools:

  • REDCap

  • Git

Popularity over time

https://www.kdnuggets.com/2010/06/software-popularity-of-data-analysis-software.html

SAS

https://www.sas.com/

  • Powerful / reliable / just works

  • Was the “standard” for pharmaceutical industry

  • Driven with code (programming)

  • Good sample size program (but is complicated)

  • Expensive / often available via Universities

SAS screenshot

SPSS

https://www.ibm.com/products/spss-statistics

  • GUI (use menus and mouse to set up analyses)
  • Code option (SPSS Syntax) - Replicate your analysis!!!
  • Python integration (generate / interact with datasets)
  • Base + Add-Ins. ~$150-300 USD/y. Student discounts

SPSS sample dataset

SPSS screenshot

SPSS data view

STATA

https://www.stata.com/

STATA screenshot

R

https://cran.r-project.org/

  • Open source (rebuild of a software called S+) = Free

  • Code based (People have created GUIs)

  • Huge community, great integrations (

  • R-Studio (Posit) fostered the “Tidyverse” which

  • Packages / can get messy / easy to break things

  • Markdown (Quarto) / Shiny = Game changers!

RStudio screenshot

R Reproducible research

R Shiny (Interactive

https://shiny.posit.co/

Python

https://www.python.org/

  • Not a traditional statistical software package

  • Pandas / NumPy / Jupyter notebooks -> data science

  • Powerful, open source, huge community, AI

  • Open-source = FREE

  • Packages / can get messy / easy to break things

Python - VSCode

Git

https://git-scm.com/

  • Version control text files (i.e. code)

  • Record changes over time + explanation of why

  • Able to roll-back to and old version

  • Very useful if collaborating with others

Git - Problem

Git - Solution

Git - Benefits

  • Backup and Restore.
  • Synchronization.
  • Short-term / long-term undo
  • Track changes
  • Track Ownership
  • Sandboxing
  • Branching and merging

https://betterexplained.com/articles/a-visual-guide-to-version-control/

REDCap

https://project-redcap.org/

  • Open source database system

REDCap (example)

Statistical tests

H0 true H1 true
Fail to reject H0 Correct Type 2 (\(\beta\))
Reject H0 Type 1 (\(\alpha\)) Correct (Power = 1-\(\beta\))

Randomisation

https://heyspinner.com/random-number-wheel/1-2

Types of data

Types of data

  • Binary

  • Categorical (ordered/unordered)

  • Continuous

  • Time-to-event

Common tests

  • Comparing proportions

  • Comparing continuous

  • Comparing time-to-event

Example dataset

From R’s “survival” package: “cancer” dataset

Survival in patients with advanced lung cancer from the North Central Cancer Treatment Group. Performance scores rate how well the patient can perform usual daily activities.

Example dataset - Variables

Variable Description
inst Institution code
time Survival time in days
status censoring status 1=censored, 2=dead
age Age in years
sex Male=1 Female=2
ph.ecog ECOG as rated by the physician.
ph.karno KPS (bad=0, good=100) rated by physician
pat.karno rated by patient
meal.cal Calories consumed at meals
wt.loss Weight loss in last six months (pounds)

Proportions

Power


     Two-sample comparison of proportions power calculation 

              n = 14
             p1 = 0.2
             p2 = 0.7051759
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group

Proportions in R

tbl_cross(cancer_clean,
          row=Event,
          col=Sex,
          percent = "column"
) %>% 
add_p()

Sex

Total

p-value

1

Male

Female

Event


<0.001
    Died 112 (81%) 53 (59%) 165 (72%)
    Censored 26 (19%) 37 (41%) 63 (28%)
Total 138 (100%) 90 (100%) 228 (100%)
1

Pearson’s Chi-squared test

Proportions in SPSS

Proportions in SPSS

Proportions in SPSS

Proportions in SPSS (tidy)

Proportions in Stata

Continuous

T-Test in R

tbl_summary(cancer_clean,
          include = c(age, meal.cal, wt.loss),
          statistic = list(all_continuous() ~ "{mean} ({sd})" ),
          by=sex
) %>% 
add_p(test=list(all_continuous() ~ "t.test"))

Characteristic

1
N = 138

1

2
N = 90

1

p-value

2
age 63 (9) 61 (9) 0.064
meal.cal 981 (413) 841 (369) 0.020
    Unknown 24 23
wt.loss 11 (13) 8 (13) 0.060
    Unknown 10 4
1

Mean (SD)

2

Welch Two Sample t-test

T-Test in SPSS

T-Test in Stata

Time to Event

Time to event - Data

Kaplan Meier

fit <- survfit(Surv(time, status) ~ 1, data = cancer)
plot(fit)

Kaplan Meier - Stata

Kaplan Meier - SPSS

Kaplan Meier - SPSS

Kaplan Meier - SPSS

Log-rank test

Median follow-up

Competing risks

Competing risk - CI

Modelling

  • Regression

    • UV

    • MV

    • Repeated measures / clustering

  • Time to event

    • Cox-proportional hazards

    • Competing risks

Cox Regression - Hazards

Cox Regression - Hazards

Proportional hazards

  • Hazard Ratio: The average ratio of these two lines (over the whole period)

Proportional hazards

  • ‘Baseline hazard’ can take any form (i.e. doesn’t have to be constant / smooth)

Non-proportional hazards

  • Can appear as curves “crossing” or “diverging”

  • If so, a single number (hazard ratio) may not be the appropriate summary

Non-proportional hazards

Sub-groups

Cox regression in R

cox_model = coxph(Surv(time, status) ~ sex, data = cancer)
summary(cox_model)
Call:
coxph(formula = Surv(time, status) ~ sex, data = cancer)

  n= 228, number of events= 165 

       coef exp(coef) se(coef)      z Pr(>|z|)   
sex -0.5310    0.5880   0.1672 -3.176  0.00149 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    exp(coef) exp(-coef) lower .95 upper .95
sex     0.588      1.701    0.4237     0.816

Concordance= 0.579  (se = 0.021 )
Likelihood ratio test= 10.63  on 1 df,   p=0.001
Wald test            = 10.09  on 1 df,   p=0.001
Score (logrank) test = 10.33  on 1 df,   p=0.001

Cox regression in R

tbl_regression(cox_model,
               exponentiate = TRUE) %>% 
  add_global_p()

Characteristic

HR

1

95% CI

1

p-value

sex 0.59 0.42, 0.82 0.001
1

HR = Hazard Ratio, CI = Confidence Interval

test.ph <-cox.zph(cox_model)
ggcoxzph(test.ph)

Time dependent Cox

Univariable models

Characteristic

N

HR

1

95% CI

1

p-value

sex 228

0.001
    male

    female
0.59 0.42, 0.82
age 228 1.02 1.00, 1.04 0.039
age_10 228 1.21 1.01, 1.44 0.039
ph.ecog 227 1.61 1.29, 2.01 <0.001
ecog 227

<0.001
    0

    1
1.45 0.98, 2.13
    2
2.50 1.61, 3.88
    3/4
9.10 1.22, 67.9
ph.karno_10 227 0.85 0.76, 0.95 0.006
meal.cal_1000 181 0.88 0.56, 1.39 0.6
wt.loss_10 214 1.01 0.90, 1.14 0.8
1

HR = Hazard Ratio, CI = Confidence Interval

  • age vs age_10?
  • ph.ecog vs ecog?

Multivariable models

Characteristic

HR

1

95% CI

1

p-value

sex


    male
    female 0.58 0.39, 0.85 0.006
age_10 1.13 0.90, 1.42 0.3
ecog


    0
    1 1.85 1.07, 3.20 0.026
    2 4.81 2.06, 11.2 <0.001
    3/4 16.0 1.78, 143 0.013
ph.karno_10 1.22 0.98, 1.52 0.076
meal.cal_1000 0.97 0.58, 1.60 0.9
wt.loss_10 0.89 0.76, 1.03 0.12
1

HR = Hazard Ratio, CI = Confidence Interval

Model selection

  • Consider your context and objective

  • Backwards / forwards selection

  • Best subset

  • LASSO regression

Want to learn more? Suggest Frank Harrell, Regression Modelling Strategies. https://hbiostat.org/rmsc/

Repeated measures / clustering

  • Mixed effects logistic regression

  • Generalised estimating equations (GEE) 

Katherine E. Francis, Reporting the trajectories of adverse events over the entire treatment course….

Summary

  • Use any package (one you can get support with)

  • Have a reproducible mindset (make life easy)

  • Use version control (keep things tidy)

  • Write comments for “future you” / others

GitHub Copilot

https://docs.github.com/en/copilot

Reproducible mindset

Aim to be able to run from source program to output report without manual intervention:

  • You won’t forget how to run / update things (+5 years)

  • Someone else can run it if they want to

  • You can’t make copy/paste / typing errors

  • Automatic updates (if you the data changes)

My poster: Embedding reproducible research principals in clinical trial analyses

Survey part 2

Please complete the 2nd part of the survey now… I really appreciate your feedback (use QR only if lost the page)

ACORD 2026

Want to develop an trial idea into a full protocol in 6 days?

Consider an ACORD protocol development workshop

https://www.moga.org.au/2026-acord-workshop

Thank you

Any questions?